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Abstract 

The  theory  of  Inductive  Inference  studies  the  problems  related  to 
finding  structural  descriptions  from  samples.  In  the  present  paper  we 
suggest  that  this  process  can  be  viewed  as  the  result  of  consecutive 
compressions  of  the  data  into  more  efficient  representations. 

We  provide  two  ways  to  improve  the  encoding  of  an  abstract  text 
and  we  show  how  they  could  constitute  the  base  of  a  learning  proce- 
dure. We  test  the  ideas  on  a  simple  example. 


1      Introduction 

One  of  the  recognized  problems  in  the  theory  of  inductive  inference  is  that  of 
comparing  hypotheses  [Angluin  and  Smith  1983,  Section  5],  which  is  to  find 
criteria  and  quantitative  meastires  to  evaluate  the  fitness  and  the  simplicity 
of  an  explanation.  The  perfect  candidate  as  a  measure  in  an  algorithmic  set- 
ting is,  of  course,  the  distance  from  Kolmogorov  complexity.  A'.  In  this  way 
it  is  possible  to  derive  a  formal  conceptualization  of  the  Occam  razor  [Kop- 
pel  1987].  Unfortunately,  only  its  intuitive  significance  can  be  used  in  ac- 
tual applications.  Its  definition  is  nonconstructive  and  Kolmogorov  himself 
proved  that  K  is  not  a  partial  recursive  function  [Zvonkin  and  Levine  1970, 
Theorem  1.5].  Nonetheless  the  theoretical  importance  of  this  approach  is 
fundamental,  and  one  is  advised  to  look  for  a  more  constructive  counterpart, 
a  measure  of  complexity  based  on  what  can  be  observed  of  a  phenomenon. 
The  use  of  Shannon  entropy,  H ,  as  a  grammatical  complexity  measure 
has  been  attempted  [Cook  et  al.  1976]  although  only  in  the  more  natural 
scenery  of  stochastic  grammars.  In  this  paper  we  propose  a  more  systematic 


use  of  it,  stimulated  by  results  which  exhibit  H^s  profound  relations  with 
Kolmogorov  complexity  [Zvonkin  and  Levine  1970,  Chaitin  1975,  Leung- 
Yang-Cheong  and  Cover  1978,  Cover  1985].  We  will  be  able  to  establish 
an  operative  link  between  Shannon  entropy  and  Kolmogorov  complexity 
utilizing  an  elaboration  on  the  concept  of  minimal  description  due  to  [Koppel 
1987]:  A  description  of  a  string  S  is  a  couple  (P,  D),  where  P  is  a  program 
and  D  is  the  data  which  when  given  as  input  to  an  Universal  Turing  Machine 
would  output  S.  A  description  is  minimal  if  |P1  +  |Z?|  is  minimal.  Intuitively, 
P  is  a  description  of  the  string  structure  and  D  specifies  that  string  among 
the  ones  sharing  the  same  structure.  In  our  information-theoretical  setting 
P  will  represent  the  code  used  and  D  the  encoding  of  S  under  the  code  P. 
A  Universal  Interpreter  will  reconstruct  5  given  (P,D). 

Section  2  will  describe  two  ways  for  improving  the  encoding  of  a  text, 
giving  the  condition  under  which  they  yield  a  gain  and  a  quantitative  esti- 
mate of  that  gain. 

In  Section  3  the  results  obtained  will  be  applied  to  the  theory  of  induc- 
tive inference.  Current  theoretical  approaches  rely  on  an  enumeration  of  the 
possible  descriptions  (usually  grammars)  according  to  an  order  imposed  on 
them  by  a  complexity  measure.  The  discussion  of  Section  2  wiU  point  out 
a  way,  using  Shannon  entropy  as  complexity  measure,  to  reverse  the  enu- 
meration, allowing  to  look  at  the  inductive  process  as  one  which  proceeds 
towards  descriptions  with  lower  and  lower  complexity. 

Section  4  will  be  devoted  to  a  discussion  of  the  applicative  work,  which 
is  under  way,  that  this  conceptualization  allows. 

2      Encoding  a  Text 

Definitions  and  Notational  Conventions 

•  T  a  two-way  infinite  text,  T;v  a  substring  of  length  TV. 

•  A  =  {a,}"_j  the  alphabet  of  the  text 

•  S  —  {5j}yLj  a  syllabary  for  T,  that  is  a  set  of  strings  of  symbols 
in  A  such  that  T  =  ...5,_jS,pSi, ...,  T  can  be  viewed  as  a  text  on  the 
alphabet  S  in  a  unique  way,  5  is  a  uniquely  decipherable  code  for  T. 
In  particular  A  is  a  syllabary  for  T. 

•  C  =  {cj}"_i  a  classification,  i.e.  a  partition  of  A,  hence  I3"=i  kil  =  " 


•  S  =  {5j}^j  the  reduced  sy//a6ary  under  the  classification  C.  To  each 
seS.  s  =  a,^...a,  corresponds  an  seS  ,  s  =  Cj^...Cj  where  Cj^  is 
the  class  of  Oj^.  Each  element  seS     is  a  class  of  syllables  of  S. 

•  ls{s)  the  length  of  the  string  s  over  the  alphabet  5.  If  5  is  understood, 
the  subscript  can  be  omitted 

•  ps{s)  the  probability  of  occurrence  of  the  syllable  s(S  in  the  text  T, 
that  is,  its  relative  frequency  with  respect  to  the  syllabary  S.  The 
subscript  is  omitted  when  there  is  no  ambiguity  about  the  syllabary 
that  should  be  considered. 

•  p(_)  the  probability  of  the  event  in  the  argument. 

•  log  X  the  logarithm  in  base  2  of  x 

Example  1  Consider  the  language  L  —  ((a6)*  U  (cd)')  ■  (ee)'  and  let  T  be 
the  a  possible  sequence  of  justappositions  of  words  in  L.  Then  A  =  {a,b,c} 
is  the  alphabet  of  T,  5  =  {ab,cd,ee}  is  a  syllabary  for  T,  C  =  {ci,C2,C3} 
where  ci  =  {a,c},C2  =  {b,d},C3  =  {e}  is  a  classification  and  the  associated 
reduced  syllabary  is  5  =  {ciC2,C3C3).  The  original  language  could  be 
expressed  as  Z  =  (C1C2)'  •  {C3C3)'. 


Shannon  Entropy 

We  are  presented  with  an  infinite  text  T  and  the  problem  of  finding  the 
optimal  encoding  in  bits.  If  T  is  encoded  considering  only  the  lowest  level 
alphabet  A.  the  lower  bound  for  average  length  per  character  of  an  optimal 
uniquely  decipherable  encoding  P(A)  is 


H(A)=  -X^p(G,)logp(«.) 


t=i 


A  substring  7}v  of  length  N  of  T  would  have  an  encryption  D(A)  with  an 
average  length  in  bits 

\D{A)\  =  N  ■  H{A)  =  -TV  X^p(G.)logp(a.)  +  \P{A)\ 

i=l 

where  |P(A)|  is  intended  as  the  length  of  the  code  table  for  the  symbols. 


Encoding  through  a  Syllabary 

We  want  now  to  investigate  under  which  conditions  an  encoding  done 
under  a  syllabary  is  convenient  with  respect  to  the  one  done  under  the  low 
level  alphabet. 

Let  S  be  a  syllabary  for  T.  We  have  that: 
The  expected  length  of  a  syllable  in  5  is 

m 

{\s\)  =  Y.pi'M'j) 

The  expected  number  of  syllables  in  T^  is 


N' 


The  expected  length  of  a  syllable  of  an  optimal  encoding  P(S) 

m 

H{S)  =  -Y.p{s,)\ogp{s,) 
j=i 

Under  the  syllabary  5,  the  expected  number  of  bits  in  which  T/v  would 
be  encoded  is 

\D{S)\  =  N'-H(S)  +  \P{S)\  = T^m      /w,     !       +  \PiS)\       (1) 


Let  us  define 

For  large  N  the  term  |P(5)|  can  be  ignored,  hence  \s  can  be  thought 
as  the  compressing  capacity  of  the  syllabary  S. 

The  gain  in  compression  obtained  under  the  encoding  S  is 

As        ,.       \D(S)\  Er=iP(5j)logp(5j) 

- —  =    lim  —  


A.4   N^^\D{A)\    Er=iP(5,)/(sj)Er=iP(«.)iogp(«.) 

The  encoding  under  the  syllabary  S  is  convenient  whenever  Xs/^A  <  1 
and  it  is  optimal  when  A5  is  minimal.  (If  we  are  deahng  with  a  finite  text 
then  the  hmit  operation  will  not  apply  and  we  wiU  have  to  consider  also  the 
length  of  the  syllable  table  P(S)). 


Example  2  Let  us  assume  that  p(a)  =  p{a')  'ia,a'eA  and  p(s)  =  p{s') 
'is.s'eS.  From  our  point  of  view,  this  is  the  situation  when  no  informa- 
tion about  the  frequency  of  occurrence  of  alphabet  symbols  and  syllables  is 
a\'ailable  or.  equivalently,  when  in  a  particular  application  its  retrieval  is  not 
advised.  It  is  well  known  that  this  assumption  would  maximize  the  entropy 
function  and  then  the  average  length  per  character  of  the  code. 

Let  us  also  assume  that  5  =  Yl'JLi  Pi^j)K^j)  is  the  average  length  of  a 
syllable  in  5.  Our  expression  for  As  becomes 

log  m 
As  =  

Under  this  assumptions  it  is  convenient  to  use  the  syllabary  whenever 
n*  >  m. 

Example  3  Let  T  a  periodic  text,  that  is  the  repetition  of  the  same  string  s. 
Then  there  a  syllabary,  namely  S  =  {5},  which  would  make  the  compressing 
factor  As  —  0.  The  gain  obtained  through  S  would  be,  in  fact,  infinite,  we 
would  be  able  to  give  a  finite  description  of  an  infinite  text. 

Encoding  through  a  Classification 

A  similar  line  of  thought  wiU  now  guide  us  in  discovering  that  in  encoding 
a  text  T  there  can  be  a  gain  in  classifying  the  symbols  of  an  alphabet  taking 
in  account  the  syllabic  structure  of  T.  Under  the  classification  C  we  have: 
The  average  length  of  the  syllables  in  S  and  the  average  length  (|s|)  of  the 
reduced  syllables  in  5     is  the  same, 

m  m' 

(1^1)  =  IIp(«j)'/i(«j)  =  Z1p(^j)'s^(^j) 

As  a  consequence,  the  average  number  A'^'  of  syllables  in  5  and  of  the  reduced 

Q 

syllables  in  5     of  a  substring  T\r  of  length  A  of  T  is  the  same.  The  average 
length  of  the  code  for  a  reduced  syllable  is 

HCS)  =  -J2p(^:)^ogpis,) 

The  average  information  required  to  determine  a  syllable  s  out  of  its  class  s 
is  of  H{S,C)  bits,  where 

H{S,C)  =  -Y^p{Sj)J2  Y.P('^J>^  =  aj.)logp(CjV,  =aj.) 

J  =  l  h=l     !=1 


Let  P{C)  the  table  for  the  classification  C,  then  under  the  induced  clas- 

Q 

sification  5     the  average  length  in  bits  of  the  codification  of  Tat  would  be 
\d(S)\  =  N'  ■  {H(S)  +  H{S,  C))  +  \P(C)\ 
The  gain  of  the  induced  classification  over  the  syllabary  S  is 

D{S)\ 


Ae  =    lim 


/v^oo  \D{S)\ 
hence 


A^  = 


Er=iP(^j)iogp(^.; 


As  before  there  is  an  effective  gain  only  when  A<f  <  1  and  the  maximum 
—C 
gain  is  reached  when  A5  is  minimal. 

The  overall  gain  of  the  encoding  by  means  of  the  syllabary  S  and  the 

classification  C  is 

^s.c  =  A5  •  As 

It  can  happen  that  Xs^c  <  1  even  if  Xs/Xa  >  1-  The  right  classification 
can  increase  the  value  of  an  otherwise  useless  syllabary. 

Example  4  Let  c  be  the  average  number  of  elements  in  the  classes  of  C 
and  s  the  average  length  of  a  syllable.  As  in  Example  2.  let  us  assume  that 
no  information  about  probability  is  available.  Then 

r-c  _  logm'  +  5logc 

Ac  —  ■ 

logm 

and 

log  m'  +  slogc 

^s.c  =  -, 

siogn 

A  classification  is  convenient  when  m'  <    ^.     The  combination  of  a 
syllabary  and  a  classification  is  convenient  when  m'  <  n'^. 


3      Inductive  Inference 

We  have  seen  that  in  some  situation  a  shift  in  point  of  view  can  lower  the 
length  of  the  encoding  of  a  finite  substring  of  T  or,  in  other  terms,  can 
lower  its  complexity  (better  percetued  complexity).  We  can,  view  a  learning 
procedure  as  any  one  which  operates  such  shifts. 

In  the  previous  section  we  have  given  two  ways  to  produce  a  change  in 
point  of  view.  We  like  to  parallel  them  to  the  two  processes  of  analysis, 
in  which  the  right  code  words  constituting  a  text  are  discovered,  and  of 
synthesis  in  which  the  right  classification  of  those  symbols  is  found  {right, 
in  the  light  of  what  observed  in  section  2,  stands  for  most  convenient). 

We  can  now  outline  the  procedure  which  immediately  follows  our  con- 
siderations: 


Algorithm  1 

1  Set  i  ^0  and  {S,C)  =  {A.A) 

2  Set  (5„C,)  =  (S,C)  ;  i  =  i+l 

3  Choose  {S,C),  where  5  is  a  syllabary,  \s\  <  /,V5f5,  and  C  a  classification, 

such  that  Xs,c  is  minimal  for  T  viewed  as  a  text  on  the  alphabet  Sil^^ 

4  If  \s,c  >  1  then  stop 

5  Go  to  step  2 

This  algorithm  will  construct  a  hierarchy  (5,,  C,)  with  possibly  infinite 
levels,  which  constitutes  the  structure  of  the  text.  We  need  to  enforce  the 
artificial  limiting  condition  on  the  maximum  size  of  the  syllables  sought  for, 
to  insure  the  termination  of  Step  3.  Different  values  of  /  might  lead  to  very 
different  results.  For  any  value  of  /  there  exists  a  structured  text  whose 
structure  cannot  be  discovered  by  Algorithm  1  limited  by  /. 

It  is  obvious  that  in  actual  applications  of  Algorithm  1,  one  need  not 
push  the  choice  of  Step  3  to  the  extreme  optimality.  One  could  as  well 
substitute  Step  3  and  4  with 

3'  Choose  (S,C),  where  S  is  a  syllabary  and  C  a  classification,  such  that 
'^s,c/'^s,_i,c,_i  <  1  if  it  exists,  else  stop. 


In  that  situation  the  procedure  will  still  "learn"  something,  and  its  effec- 
tiveness will  be  related  to  how  close  the  choice  in  Step  3'  is  to  the  optimal. 
In  this  way  we  could  widen  the  actual  applicability  of  Algorithm  1,  origi- 
nally restricted  by  the  high  complexity  of  Step  3.  We  could,  in  fact,  use 
more  illuminated  procedures  than  the  dull  thorough  search  over  all  the  pos- 
sible syllabaries.  Extensive  work,  from  which  the  approach  of  the  present 
paper  stems,  has  already  been  done  to  tackle  this  problem  [CaianieUo  and 
Capocelli  1971,  1976].  If  one  limits  the  search  only  to  uniquely  decipherable 
codes  the  task  is  considerably  easier.  [CaianieUo  and  Capocelli  1976]  give 
four  different  algorithm,  each  one  seeking  for  a  different  type  of  code.  AU  of 
the  algorithms  show  adaptive  characteristics,  they  identify  in  the  limit  [Gold 
1967]  the  structure  sought  for. 

The  search  for  comparably  better  classification  procedures  is  currently 
under  way. 

4      Experimental  Work  and  Applications:    Gram- 
matical Inference 

The  procedure  outlined  in  Section  3  by  Algorithm  1  provides  a  construc- 
tive method  to  achieve  grammatical  inference  which,  of  course,  needs  to  be 
refined  according  to  the  needs  of  the  specific  applications.  However,  there 
is  still  the  need  for  much  work  in  order  to  formalize  all  the  necessary  steps 
towards  the  formulation  of  a  solid  general  project. 

Before  undertaking  such  an  effort  one  might  feel  the  necessity  to  exper- 
iment the  validity  of  the  ideas  involved.  The  first  kind  of  experiment  that 
one  can  think  of  is  to  check  how  other  existing  approaches  fit  in  the  scheme 
proposed.  Let  us  consider  a  text  T  consisting  of  set  of  100  words  randomly 
generated  using  the  following  grammar,  a  simplified  version  of  the  one  given 
by  [Grishman  1986.  Section  2.4.1]. 


<s> 

:=  <SUBJ>  <VERB>  <OBJ> 

<SUBJ> 

:=  <NSTG> 

<PN> 

:=  P   <NSTG> 

<NSTG> 

:=  <LNR> 

<LNR> 

:=  <LN>  A^  <RN> 

<LN> 

:=  <TPOS>  <APOS> 

<TPOS> 

:=  T  1  null 

<APOS> 

::=  ADJ  \  null 

<RN> 

::=   <PN>  1  null 

<VERB> 

<LTVR> 

<LV> 

<RV> 

<LVR> 

<OBJ> 

<TOVO> 


=  <LTVR> 

=  <LV>  T\'  <RV> 

=  D  I  null 

=  D  I  <PN>    I   mill 

-  <LV>   V   <RV> 

=  <NSTG>    I    <TOVO>    I   null 

=  io  <LVR>   <OBJ> 


If  we  run  twice  the  algorithm  N  in  [Caianiello  and  Capocelli  1976]  on 
such  a  set  of  sentences  (which  corresponds  to  two  iterations  of  Steps  1,2  and 
3  of  Algorithm  1,  hence  to  finding  the  second  level  structure),  we  get  the 
syllabary 


5=  { 


[ioY)         (PTN)       [T  ADJ  N) 
{P  ADJ  N)     {toDV)         {ADJ  N) 
(N)  (T  N)       {P  T  ADJ  N) 


{PN) 

{D) 
{TV) 


In  the  following  table  we  can  compare  the  values  for  the  quantities  H 
{\s\)  and  A  for  the  alphabet  A  of  the  terminal  symbols  of  the  grammar, 
5,  and  52  consisting  of  all  the  digrams  appearing  in  the  text  and  all  the 
sentence-ending  monograms  (to  guarantee  parsing  of  sentences  with  an  odd 
number  of  components). 


H 

(kl) 

A 

A 

2.701 

1 

2.701 

S 

3.387 

2.02 

1.676 

S2 

3.254 

1.49 

2.183 

As  we  see  there  is  a  sensible  difference  in  the  gain  obtained  using  al- 
gorithm N.  Moreover,  if  we  notice  that  the  grammar  chosen  is  one  which 
describes  (even  if  incompletely)  the  word  classes  phrase  structure  of  English, 
we  find  that  the  syllables  discovered  by  algorithm  N  coincide  with  the  ones 
which  a  person  (not  only  a  linguist)  would  normally  consider  to  be  the  ele- 
mentary strings  of  the  phrase  structure.  The  results  obtained  by  [Caianiello 
and  Capocelli  1971,  1976]  in  discovering  syllables  in  Italian  texts  and  the  one 
here  exposed  aUows  some  conjectures  which  will  guide  our  future  work:  A 


9 


reaches  a  local  minimum  over  syllabaries  which  could  be  and  have  been  dis- 
covered by  means  of  other  "reasonable"  considerations.  The  process  which 
guides  language  use  is  the  same  at  different  levels  and  linguistic  ability  is 
the  superimposition  of  the  same  operations  on  different  substrata. 
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